video synthesis
Plan-X: Instruct Video Generation via Semantic Planning
Huang, Lun, Xie, You, Xu, Hongyi, Gu, Tianpei, Zhang, Chenxu, Song, Guoxian, Li, Zenan, Zhao, Xiaochen, Luo, Linjie, Sapiro, Guillermo
Diffusion Transformers have demonstrated remarkable capabilities in visual synthesis, yet they often struggle with high-level semantic reasoning and long-horizon planning. This limitation frequently leads to visual hallucinations and mis-alignments with user instructions, especially in scenarios involving complex scene understanding, human-object interactions, multi-stage actions, and in-context motion reasoning. To address these challenges, we propose Plan-X, a framework that explicitly enforces high-level semantic planning to instruct video generation process. At its core lies a Semantic Planner, a learnable multimodal language model that reasons over the user's intent from both text prompts and visual context, and autoregressively generates a sequence of text-grounded spatio-temporal semantic tokens. These semantic tokens, complementary to high-level text prompt guidance, serve as structured "semantic sketches" over time for the video diffusion model, which has its strength at synthesizing high-fidelity visual details. Plan-X effectively integrates the strength of language models in multimodal in-context reasoning and planning, together with the strength of diffusion models in photorealistic video synthesis. Extensive experiments demonstrate that our framework substantially reduces visual hallucinations and enables fine-grained, instruction-aligned video generation consistent with multimodal context.
- North America > Mexico > Gulf of Mexico (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Europe > United Kingdom > England > Staffordshire (0.04)
- Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (0.68)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (0.68)
Echo-Path: Pathology-Conditioned Echo Video Generation
Muhammad, Kabir Hamzah, Elbatel, Marawan, Qin, Yi, Li, Xiaomeng
Cardiovascular diseases (CVDs) remain the leading cause of mortality globally, and echocardiography is critical for diagnosis of both common and congenital cardiac conditions. However, echocardiographic data for certain pathologies are scarce, hindering the development of robust automated diagnosis models. In this work, we propose Echo-Path, a novel generative framework to produce echocardiogram videos conditioned on specific cardiac pathologies. Echo-Path can synthesize realistic ultrasound video sequences that exhibit targeted abnormalities, focusing here on atrial septal defect (ASD) and pulmonary arterial hypertension (PAH). Our approach introduces a pathology-conditioning mechanism into a state-of-the-art echo video generator, allowing the model to learn and control disease-specific structural and motion patterns in the heart. Quantitative evaluation demonstrates that the synthetic videos achieve low distribution distances, indicating high visual fidelity. Clinically, the generated echoes exhibit plausible pathology markers. Furthermore, classifiers trained on our synthetic data generalize well to real data and, when used to augment real training sets, it improves downstream diagnosis of ASD and PAH by 7% and 8% respectively. Code, weights and dataset are available here.
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States > New York (0.04)
Semantic Communication-Enabled Cloud-Edge-End-collaborative Metaverse Services Architecure
Li, Yuxuan, Jinag, Sheng, Wang, Bizhu
With technology advancing and the pursuit of new audiovisual experiences strengthening, the metaverse has gained surging enthusiasm. However, it faces practical hurdles as substantial data like high-resolution virtual scenes must be transmitted between cloud platforms and VR devices. Specifically, the VR device's wireless transmission hampered by insufficient bandwidth, causes speed and delay problems. Meanwhile, poor channel quality leads to data errors and worsens user experience. To solve this, we've proposed the Semantic Communication-Enabled Cloud-Edge-End Collaborative Immersive Metaverse Service (SC-CEE-Meta) Architecture, which includes three modules: VR video semantic transmission, video synthesis, and 3D virtual scene reconstruction. By deploying semantic modules on VR devices and edge servers and sending key semantic info instead of focusing on bit-level reconstruction, it can cut latency, resolve the resource-bandwidth conflict, and better withstand channel interference. Also, the cloud deploys video synthesis and 3D scene reconstruction preprocessing, while edge devices host 3D reconstruction rendering modules, all for immersive services. Verified on Meta Quest Pro, the SC-CEE-Meta can reduce wireless transmission delay by 96.05\% and boost image quality by 43.99\% under poor channel condition.
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > California > Monterey County > Pacific Grove (0.04)
SpatialDreamer: Self-supervised Stereo Video Synthesis from Monocular Input
Lv, Zhen, Long, Yangqi, Huang, Congzhentao, Li, Cao, Lv, Chengfei, Ren, Hao, Zheng, Dian
Stereo video synthesis from a monocular input is a demanding task in the fields of spatial computing and virtual reality. The main challenges of this task lie on the insufficiency of high-quality paired stereo videos for training and the difficulty of maintaining the spatio-temporal consistency between frames. Existing methods primarily address these issues by directly applying novel view synthesis (NVS) techniques to video, while facing limitations such as the inability to effectively represent dynamic scenes and the requirement for large amounts of training data. In this paper, we introduce a novel self-supervised stereo video synthesis paradigm via a video diffusion model, termed SpatialDreamer, which meets the challenges head-on. Firstly, to address the stereo video data insufficiency, we propose a Depth based Video Generation module DVG, which employs a forward-backward rendering mechanism to generate paired videos with geometric and temporal priors. Leveraging data generated by DVG, we propose RefinerNet along with a self-supervised synthetic framework designed to facilitate efficient and dedicated training. More importantly, we devise a consistency control module, which consists of a metric of stereo deviation strength and a Temporal Interaction Learning module TIL for geometric and temporal consistency ensurance respectively. We evaluated the proposed method against various benchmark methods, with the results showcasing its superior performance.
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)